Add a pre-commit script to detect missing i18n implementations #9428

pidgezero-one · 2024-06-13T01:10:39Z

Adds a pre-commit script that finds HTML that might be missing i18n implementation

Technical

Parses the contents of html files in specific directories to detect the line and column numbers for tags that might be missing i18n implementation.

Mostly a series of loops to ensure that the provided indexes are correct.

Testing

I've been adding junk data to html files and running both pre-commit run and pre-commit run --all-files, and it looks like the scope of those two different commands is correct. However, the regex approach returns quite a lot of what I think are false positives.

Screenshot

Stakeholders

@rebecca-shoptaw

for more information, see https://pre-commit.ci

…dgezero-one/openlibrary into 9423/feat/detect-missing-i18n

pidgezero-one · 2024-06-13T01:34:04Z

@rebecca-shoptaw @cdrini

So far this appears to be working as expected according to the issue description (except I'm not sure how to get it to only run against PR files instead of all files in the action runner), however my regex approach flags quite a lot of what I think are false positives (example: https://results.pre-commit.ci/run/github/69609/1718241262.NJ_4UJGwTqOunVrs7ivFkA)

Some examples:

I think this should ignore <code> elements.
Should it be considered an error when the contents of an element are a template variable? Example: ERRO openlibrary/macros/SearchResults.html:23:36 <a href="$book.key/$title.replace(' ', '-')">$book.title_prefix $title</a></span> or ERRO openlibrary/templates/covers/book_cover_single_edition.html:18:16 <div class="BookTitle">$:macros.TruncateString(title, 70)
I'm not sure what the meaning is of \$$, for example: ERRO openlibrary/templates/admin/sponsorship.html:69:6 <td>\$$('{:,.2f}'.format(summary['avg_scan_cost'] / 100.))</td> - is this something else that should avoid flagging?
There are some instances where the same error is flagged twice when an outer element is on the same line as an inner element and they're separated by a superfluous space:

ERRO openlibrary/templates/subjects.html:79:12  <span class="count">        <a href="/search?${page.subject_type}_facet=$page.name.replace('&','%26')">Search within $page.name</a></span>
ERRO openlibrary/templates/subjects.html:79:40  <a href="/search?${page.subject_type}_facet=$page.name.replace('&','%26')">Search within $page.name</a></span>

(I am assuming that this is an indication that the space should be deleted and not that the script should account for this?)

I don't have a good answer for tags that are valid inside i18n functions, like <strong> or <em> or maybe <span>. I thought about ignoring any result that matches those three tags, but then that would miss things like <div><span>untranslated text</span></div>. I've also considered stripping those tags from the text before processing it, but that would cause incorrect position indexes in the error messages.

Please let me know what your thoughts are on the above and I'll be happy to make any changes!

for more information, see https://pre-commit.ci

rebecca-shoptaw · 2024-06-14T13:45:30Z

@pidgezero-one This is looking great!!
To respond in order:

I think this should ignore <code> elements.

+1 re: ignoring <code> divs, that's a good catch

Should it be considered an error when the contents of an element are a template variable? Example: ERRO openlibrary/macros/SearchResults.html:23:36 <a href="$book.key/$title.replace(' ', '-')">$book.title_prefix $title</a></span> or ERRO openlibrary/templates/covers/book_cover_single_edition.html:18:16 <div class="BookTitle">$:macros.TruncateString(title, 70)

I agree that $-escaped template variables should be ignored, ideally a $ that is not followed by a ( (the syntax intended to trigger WARN) should be fine

I'm not sure what the meaning is of \$$, for example: ERRO openlibrary/templates/admin/sponsorship.html:69:6 <td>\$$('{:,.2f}'.format(summary['avg_scan_cost'] / 100.))</td> - is this something else that should avoid flagging?

I believe that's an escaped $ associated with a dollar amount (scan cost), a probably very unusual edge case, but we should avoid flagging it

(I am assuming that this is an indication that the space should be deleted and not that the script should account for this?)

+1, I think you can go ahead and delete the space! I imagine this script is going to catch a number of random little errors like that which we can just fix

I don't have a good answer for tags that are valid inside i18n functions, like <strong> or <em> or maybe <span>. I thought about ignoring any result that matches those three tags, but then that would miss things like <div><span>untranslated text</span></div>. I've also considered stripping those tags from the text before processing it, but that would cause incorrect position indexes in the error messages.

@cdrini has a good solution for this which he can describe in more detail! I believe that the idea is that once you've found a correctly translated string, you check whether any untranslated strings are inside it and ignore those? Or something along those lines, he can go through it fully.

And re: how to just run on changed files, I believe pre-commit will automatically just pass changed files to the function, i.e. i18n_checker.py foo1.html foo2.html if running following an HTML commit. Again CC @cdrini for how exactly the process works, pre-commit itself should be doing most of the heavy lifting for you!

pidgezero-one · 2024-06-14T14:45:34Z

Thank you @rebecca-shoptaw ! I've added some changes that will handle the first four points and will need to think a bit more about how to get the fifth to work.

When I run pre-commit run locally, it only runs against staged files, but in the CI pipeline it looks like it's running for all files. I'm not too familiar with the CI pipeline yet so I'm not sure how to fix that!

for more information, see https://pre-commit.ci

rebecca-shoptaw · 2024-06-14T14:56:20Z

@pidgezero-one Great, thank you! Definitely worth waiting for Drini's input on number 5, he has a solid method he thought through for it some time ago and can easily pass along. Very glad it's running against staged files locally, we can investigate re: the CI.

…dgezero-one/openlibrary into 9423/feat/detect-missing-i18n

cdrini · 2024-06-14T18:32:24Z

That all looks good! For 5, I'd say if the line/char with the > prefix error has a $: somewhere on the line before the error, then we can skip it. Since anything that contains html will need a $: before it (e.g. $:_('Stay <strong>strong</strong>'). That should let us avoid these with very few false positives!

pidgezero-one · 2024-06-14T19:37:58Z

Thank you for the guidance @rebecca-shoptaw and @cdrini! 😊 I believe the next challenge is that it flags elements which contain punctuation and no words - do we want to go as far as to treat those as errors (or even warnable if encapsulated in $())?

cdrini · 2024-06-14T20:08:47Z

Wow the results for this are looking so great!! This is going to be so useful 😊

Hmmm can you give some examples? I saw a few with × and   and ←. Maybe we can erase these if they're the thing that immediately follow the failing >? Something like replace (×| |←)+\s* after the failing > with '', and see if it still fails? Would that be too difficult?

cdrini · 2024-06-14T20:09:04Z

Oh also can you try running the script with time? I'm curious to see how long it takes!

pidgezero-one · 2024-06-15T04:35:36Z

@cdrini So I think this might actually be a trivial problem to solve with the \p{L} operator, but that would require importing the pip regex library (the standard re lib does not support it). Do you think it'd be worth introducing a new dependency?

I think that access to that operator could be useful to have in general (I've used it in other projects for regexes that needed to look for non-romance words, for example) but wanted to get your (and @rebecca-shoptaw )'s opinions before doing anything drastic.

And when you say running the script with time, is that a pre-commit argument or do you mean literally recording timestamps within the script?

cdrini · 2024-06-15T12:14:08Z

Hmm, how would \p{L} help in this case?

pidgezero-one · 2024-06-23T01:52:55Z

.pre-commit-config.yaml

+      entry: python ./scripts/detect_missing_i18n.py
+      types: [html]
+      language: python
+      verbose: true


I've added this so that SKIP lines will still show even when the exit code is 0. Pre-commit will hide the script output if the exit code is 0.

pidgezero-one · 2024-06-24T12:55:54Z

TIL about require_serial in pre-commit! The first problem is fixed now (and I can also see now that the total time to run this script is 0.20!!!), but the discrepancies between CI and local behaviour are still an issue.

We only had two files outside of the valid dirs, and it makes this CLI a little cleaner since every specified file will now be processed instead of some silently skipped.

cdrini

Ok looking good! Ah this is failing locally, the text is just hidden since there's a lot of output. The bug is related to when the errcount is incremented! Should be super close know 😊

Run pre-commit run detect-missing-i18n --all-files then echo $? to see the status code.

I'm going to be offline for the next spell but this is the last bug then we should be good to go! Would you mind taking over for me on this one @rebecca-shoptaw ?

scripts/detect_missing_i18n.py

cdrini · 2024-06-25T13:06:57Z

scripts/detect_missing_i18n.py

+            elif includes_error_attribute:
+                char_index = includes_error_attribute.start()
+                errtype = Errtype.ERR
+                errcount += 1


Ah this is where the bug is! The errors are being counted even when we continue down below. We should only +1 at the end after we run all the continues.

pidgezero-one · 2024-06-25T13:14:34Z

Run pre-commit run detect-missing-i18n --all-files then echo $? to see the status code.

I think there's still a discrepancy - I'm also referring to the case where every file is skipped. Up until my most recent changes, pre-commit run detect-missing-i18n --all-files skips every file and shows an error code of 0 locally using echo $?, but skipping every file in the CI exits with code 1.

Co-authored-by: Drini Cami <cdrini@gmail.com>

pidgezero-one · 2024-06-25T13:36:13Z

Figured it out - the CI is running the script against files I hadn't yet pulled locally and detecting errors in them. I've added all outstanding files to the exclude list so we can start tackling them. All checks pass now!

cdrini

Niiiice good thinking! This code lgtm! 😊 @rebecca-shoptaw if you could give it a quick sanity check QA, and then merge, that would be awesome! (I'm afk for a spell 😁)

rebecca-shoptaw

@pidgezero-one Great work on this!!
QA:
Tried running pre-commit run --all-files with no i18n formatting changes:
✅ Detect missing i18n check passes
❓ Output for all skipped files appears, but if this only happens for --all-files which seems to be the case, that's fine with me and could be useful
❓It would be awesome if the skipped files comment was even clearer re: the process to fix the problem(s), i.e. "remove the file from exclude list then run the script again to see what the issue is"

Tried running pre-commit with a test commit after making an i18n formatting error in a non-skipped file:
✅ Successfully got an error message when un-i18n-syntaxing both basic text and complex edge cases
✅ Did not see skipped files output

Tried running pre-commit with a test commit after adding new non-i18n-syntaxed text:
✅ Successfully got an error message
✅ Error message disappeared and commit succeeded when I fixed the syntax

Looks great to me! I'm happy to approve a merge, and I'm thinking that once it's merged I'll assign myself a deep dive into those skipped files to try to fix as many as I can, and can bundle a minor instructions clarification into that. 🙂

pidgezero-one · 2024-06-25T17:10:02Z

Thank you @rebecca-shoptaw ! 😊

❓ Output for all skipped files appears, but if this only happens for --all-files which seems to be the case, that's fine with me and could be useful

Yep, this is the case! Regardless of if a file is skipped or errored, it'll only show if that file was passed to the script, so without --all-files it would only apply to staged files.

❓It would be awesome if the skipped files comment was even clearer re: the process to fix the problem(s), i.e. "remove the file from exclude list then run the script again to see what the issue is"

By "skipped files comment" what does this refer to? A comment in the code, or the script's output, or the instructions I put in the wiki?

I'll be happy to address that ASAP so it can be merged - unfortunately the longer this PR remains open, the more updates I'll have to make to it as other PRs are merged with HTML changes that will set off the CI 😵‍💫

rebecca-shoptaw · 2024-06-25T17:15:54Z

@pidgezero-one No need to make any more changes! I just meant the comment in the code itself (added with "explain exclude list"), and I'll tweak it if necessary when I do my new/related PR to fix up those excluded files 🙂

I've approved the changes, we just need a staff member with merge powers to go ahead and merge

pidgezero-one · 2024-06-25T17:17:25Z

@rebecca-shoptaw Oh okay cool! Thank you so much! 😄

cdrini · 2024-06-25T20:08:10Z

Woohoo!! Thank you so much folks!!

pidgezero-one and others added 4 commits June 12, 2024 17:40

test

03d3913

progress

41bfb28

remove test exit code

4f980de

[pre-commit.ci] auto fixes from pre-commit.com hooks

6afcf66

for more information, see https://pre-commit.ci

pidgezero-one changed the title ~~9423/feat/detect missing i18n~~ WIP: Add a pre-commit script to detect missing i18n implementations Jun 13, 2024

pidgezero-one and others added 5 commits June 12, 2024 21:13

linting fix

55f9c03

remove test html

3db28eb

[pre-commit.ci] auto fixes from pre-commit.com hooks

0765d6b

for more information, see https://pre-commit.ci

didn't know multi import was bad

97c1508

Merge branch '9423/feat/detect-missing-i18n' of https://github.com/pi…

28a3abd

…dgezero-one/openlibrary into 9423/feat/detect-missing-i18n

pidgezero-one and others added 4 commits June 12, 2024 21:52

use one-based columns

eda7cac

[pre-commit.ci] auto fixes from pre-commit.com hooks

fb88e57

for more information, see https://pre-commit.ci

redundant code removal

77bd369

erroneous index

88f1bf4

next pass - ignore code blocks and 569XZilmsescaped vars

8d95784

pidgezero-one and others added 2 commits June 14, 2024 10:54

Add special case for openlibrary/templates/books/edit.html

5439bc6

[pre-commit.ci] auto fixes from pre-commit.com hooks

d2ba81c

for more information, see https://pre-commit.ci

pidgezero-one added 2 commits June 14, 2024 11:03

I don't think this was working

b0dc371

Merge branch '9423/feat/detect-missing-i18n' of https://github.com/pi…

12f9f61

…dgezero-one/openlibrary into 9423/feat/detect-missing-i18n

exclude subtext

5e140b4

pidgezero-one commented Jun 23, 2024

View reviewed changes

pidgezero-one marked this pull request as ready for review June 24, 2024 12:51

require_serial should be true

2b5f6b4

pidgezero-one requested a review from cdrini June 24, 2024 12:57

cdrini changed the title ~~WIP: Add a pre-commit script to detect missing i18n implementations~~ Add a pre-commit script to detect missing i18n implementations Jun 25, 2024

cdrini added 3 commits June 25, 2024 12:35

Make detect_missing_i18n run over all by default

cf7dd49

Reduce detect_i18n_messages exclude list

64a76a5

Run detect_missing_i18n over all HTML

7433735

We only had two files outside of the valid dirs, and it makes this CLI a little cleaner since every specified file will now be processed instead of some silently skipped.

cdrini requested changes Jun 25, 2024

View reviewed changes

pidgezero-one and others added 4 commits June 25, 2024 09:15

Update scripts/detect_missing_i18n.py

a8d22b6

Co-authored-by: Drini Cami <cdrini@gmail.com>

update exclusions

d3b8b8f

Merge branch 'master' into 9423/feat/detect-missing-i18n

82a5086

add unmerged files

3ec6d95

cdrini assigned rebecca-shoptaw and unassigned cdrini Jun 25, 2024

pidgezero-one requested a review from cdrini June 25, 2024 13:36

cdrini approved these changes Jun 25, 2024

View reviewed changes

explain the exclude list

37ae3a3

rebecca-shoptaw approved these changes Jun 25, 2024

View reviewed changes

rebecca-shoptaw added the Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. label Jun 25, 2024

cdrini mentioned this pull request Jun 25, 2024

Use new pre-commit script to detect and fix existing i18n syntax issues #9486

Closed

6 tasks

cdrini merged commit c546c6a into internetarchive:master Jun 25, 2024
4 checks passed

rebecca-shoptaw removed the Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. label Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a pre-commit script to detect missing i18n implementations #9428

Add a pre-commit script to detect missing i18n implementations #9428

pidgezero-one commented Jun 13, 2024 •

edited

Loading

pidgezero-one commented Jun 13, 2024 •

edited

Loading

rebecca-shoptaw commented Jun 14, 2024

pidgezero-one commented Jun 14, 2024

rebecca-shoptaw commented Jun 14, 2024 •

edited

Loading

cdrini commented Jun 14, 2024 •

edited

Loading

pidgezero-one commented Jun 14, 2024

cdrini commented Jun 14, 2024

cdrini commented Jun 14, 2024

pidgezero-one commented Jun 15, 2024 •

edited

Loading

cdrini commented Jun 15, 2024

pidgezero-one Jun 23, 2024

pidgezero-one commented Jun 24, 2024 •

edited

Loading

cdrini left a comment •

edited

Loading

cdrini Jun 25, 2024

pidgezero-one commented Jun 25, 2024

pidgezero-one commented Jun 25, 2024

cdrini left a comment

rebecca-shoptaw left a comment •

edited

Loading

pidgezero-one commented Jun 25, 2024 •

edited

Loading

rebecca-shoptaw commented Jun 25, 2024 •

edited

Loading

pidgezero-one commented Jun 25, 2024

cdrini commented Jun 25, 2024

Add a pre-commit script to detect missing i18n implementations #9428

Add a pre-commit script to detect missing i18n implementations #9428

Conversation

pidgezero-one commented Jun 13, 2024 • edited Loading

Technical

Testing

Screenshot

Stakeholders

pidgezero-one commented Jun 13, 2024 • edited Loading

rebecca-shoptaw commented Jun 14, 2024

pidgezero-one commented Jun 14, 2024

rebecca-shoptaw commented Jun 14, 2024 • edited Loading

cdrini commented Jun 14, 2024 • edited Loading

pidgezero-one commented Jun 14, 2024

cdrini commented Jun 14, 2024

cdrini commented Jun 14, 2024

pidgezero-one commented Jun 15, 2024 • edited Loading

cdrini commented Jun 15, 2024

pidgezero-one Jun 23, 2024

Choose a reason for hiding this comment

pidgezero-one commented Jun 24, 2024 • edited Loading

cdrini left a comment • edited Loading

Choose a reason for hiding this comment

cdrini Jun 25, 2024

Choose a reason for hiding this comment

pidgezero-one commented Jun 25, 2024

pidgezero-one commented Jun 25, 2024

cdrini left a comment

Choose a reason for hiding this comment

rebecca-shoptaw left a comment • edited Loading

Choose a reason for hiding this comment

pidgezero-one commented Jun 25, 2024 • edited Loading

rebecca-shoptaw commented Jun 25, 2024 • edited Loading

pidgezero-one commented Jun 25, 2024

cdrini commented Jun 25, 2024

pidgezero-one commented Jun 13, 2024 •

edited

Loading

pidgezero-one commented Jun 13, 2024 •

edited

Loading

rebecca-shoptaw commented Jun 14, 2024 •

edited

Loading

cdrini commented Jun 14, 2024 •

edited

Loading

pidgezero-one commented Jun 15, 2024 •

edited

Loading

pidgezero-one commented Jun 24, 2024 •

edited

Loading

cdrini left a comment •

edited

Loading

rebecca-shoptaw left a comment •

edited

Loading

pidgezero-one commented Jun 25, 2024 •

edited

Loading

rebecca-shoptaw commented Jun 25, 2024 •

edited

Loading